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Abstract 

To  transform  a  sequential  program  into  a  concurrent  program,  a  compiler  typically  divides 
the  sequential  program  into  a  partially-ordered  set  of  basic  blocks,  allowing  unrelated  blocks 
to  execute  concurrently.  Blocks  may  execute  concurrently  only  if  there  are  no  dependencies 
among  them,  and  therefore  a  compiler  can  introduce  concurrency  only  to  the  extent  that  it 
can  guarantee  the  absence  of  dependencies.  A  limitation  of  this  technique  is  that  it  is  nec¬ 
essarily  conservative:  it  may  be  difficult  or  impossible  to  prove  the  absence  of  dependencies 
even  when  no  dependencies  exist. 

This  paper  investigates  optimistic  parallelization,  a  complementary  technique  for  paralleliz¬ 
ing  sequential  code.  Blocks  with  potential  conflicts  are  allowed  to  execute  in  parallel,  and 
conflicts  are  detected  at  run-time.  When  a  conflict  is  detected,  the  conflicting  blocks  are 
rolled  back  and  re-executed  in  sequential  order.  Optimistic  parallelization  can  enhance  con¬ 
currency  when  the  compiler  cannot  prove  the  absence  of  dependence  among  independent 
blocks,  and  when  dependencies  occur,  but  are  sufficiently  rare. 

We  show  how  conflict  detection  and  roll-back  can  be  accomplished  efficiently  through  rela¬ 
tively  simple  changes  to  the  caches  and  the  cache-coherence  protocol  of  a  shared-memory 
multiprocessor.  We  then  show  how  a  compiler  might  exploit  these  mechanisms  when  paral¬ 
lelizing  programs.  Finally,  using  simulation  results,  we  show  that  optimistic  parallelization 
using  our  mechanisms  can  give  good  performance. 
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1  Introduction 


To  transform  a  sequential  program  into  a  concurrent  program,  a  compiler  typically  divides 
the  sequential  program  into  a  partially-ordered  sot  of  basic  blocks  (single-entry  single-exit 
code  blocks),  allowing  unrelated  blocks  to  execute  concurrently.  For  example,  consider  the 
following  loop: 

for  (i  ■  0;  i  <  N;  i+*) 

A[i]  -  2  ♦  B[i]; 

Each  iteration  of  this  loop  might  be  considered  a  basic  block,  and  these  blocks  might  be 
distributed  among  the  processors  on  a  multiprocessors  as  follows: 

for  (i  =  0  ♦  PROCESSOR;  i  <  N;  i  +=  NO.OF.PROCESSORS) 

ACi]  =  2  *  B[i]  : 

Thic  transformation,  however,  fails  to  preserve  correctness  if  any  of  the  following  data 
dependencies  occur: 

•  flow-dependency:  an  earlier  iteration  writes  a  variable  read  by  a  later  iteration. 

•  anti-dependency:  an  earlier  iteration  reads  a  variable  written  by  a  later  iteration. 

•  output-dependency:  two  iterations  write  to  the  same  variable. 

A  compiler  can  introduce  concurrency  only  to  the  extent  that  it  can  guarantee  the  absence 
of  such  dependencies.  The  Banerjee  test  (2]  is  the  basis  for  most  compile-time  techniques 
for  proving  the  absence  of  data  dependencies.  (For  example,  see  [4,  14,  15]) 

A  limitation  of  such  techniques  is  that  they  are  nece.ssarily  conservative:  it  may  be  difficult 
or  impossible  to  prove  the  absence  of  dependencies  even  when  no  dependencies  exist.  For 
example,  the  procedure  permute  of  Figure  1  has  an  output  dependency  if  there  are  two 
index  values,  ij  and  J2,  such  that  5[j||  =  /?(»2].  A  compiler  may  parallelize  this  loop  only 
if  it  can  establish  that  for  every  invocation  of  permute,  no  such  ii  and  i2  exist.  In  addition, 
the  compiler  must  ensure  that  the  arrays  A  and  B  do  not  overlap  in  memory,  nor  do  A  and 
C .  Proving  such  properties  is  undecidable  in  general,  and  often  difficult  or  impossible  in 
practice. 

This  paper  investigates  optimistic  pnmllelizntitm.  a  complementary  technique  for  introduc¬ 
ing  concurrency  into  sequential  programs.  Blocks  with  potential  conflicts  are  allowed  to 
execute  in  parallel,  and  conflicts  are  detected  at  run-time.  When  a  conflict  is  detected, 
the  conflicting  blocks  are  rolled  back  and  ro-executed  in  sequential  order.  Optimistic  par¬ 
allelization  can  enhance  concurrency  in  circumstances  when  the  compiler  cannot  prove  the 
absence  of  dependence  among  independent  blocks,  and  when  dependencies  do  occur,  but 
are  sufficiently  rare.  Optimistic  parallelization  does  not  exclude  the  use  of  conventional 
methods  when  absence  of  conflict  is  detectable. 

The  premise  of  this  paper  is  that  simple  hardware  support  can  make  optimistic  paral¬ 
lelization  an  effective  technique  for  introducing  parallelism  into  certain  kinds  of  sequentiad 
programs.  Optimistic  concurrency  control,  in  one  form  or  another,  is  an  old  idea.  Our 
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void  permute  () 
{  int  i; 


for  (i  »  0;  i  <  N;  i++) 
ACBCi]]  -  CCi]; 

> 


Figure  1:  Problcmatir  Loop 


contribution  is  to  explore  the  feasibility  of  optimistic  methods  in  a  specific  context;  par¬ 
allelizing  sequential  code  on  shared-memory  multiprocessors.  Optimistic  techniques  can 
provide  adequate  performance  only  if: 

•  Data  conflicts  are  sufficiently  rare. 

•  Conflict  detection  is  sufficiently  inexpensive. 

•  Roll-back  is  sufficiently  fast. 

In  this  paper,  we  focus  on  the  last  two  issues.  To  make  ronflict  detection  and  roll-back  fast, 
we  propose  a  set  of  simple  modifications  to  standard  caches  and  cache  consistency  protocols. 
To  test  our  approach,  we  hand-compiled  a  number  of  programs  found  in  the  literature,  and 
ran  them  on  a  simulated  multiprocessor  incorporating  our  modifications.  Our  results  are 
promising;  we  were  able  to  speed  up  a  number  of  applications,  some  substantiadly.  We 
believe  that  optimistic  techniques  merit  further  study. 


2  Architecture 

This  section  describes  the  basic  architectural  support  needed  for  optimistic  parallelization. 
The  description  is  given  in  terms  of  transactions  on  the  shared  memory.  In  later  sections 
we  will  introduce  additional  refinements  to  the  propo.sed  mechanisms. 

A  transaction  is  a  finite  sequence  of  machine  instructions,  executed  by  a  single  process, 
satisfying  the  following  properties; 

•  Serializability.  Transactions  appear  to  execute  in  a  serial,  one-at-at-time  order. 

•  Atomicity.  Each  transaction  makes  a  sequence  of  tentative  changes  to  shared  memory. 
When  the  transaction  completes,  it  either  cornmils,  making  its  changes  visible  to  the 
other  processes,  or  it  aborts,  causing  its  changes  to  be  discarded. 

Whenever  the  compiler  is  unable  to  establish  the  absence  of  dependencies  among  concurrent 
blocks,  those  blocks  are  executed  as  transaction.s.  The  notion  of  a  transaction  originated 
in  the  database  literature  (viz.  [7]).  Unlike  database  transactions,  which  may  access  large 
amounts  of  data  residing  on  a  disk,  our  transactions  arc  short-lived  activities  that  access 
a  relatively  small  number  of  memory  locations  in  primary  memory.  Concurrent  database 
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transactions  may  (usually)  be  serializable  in  any  order,  but  our  transactions  must  be  seri- 
ailizable  in  the  order  of  their  corresponding  basic  blocks. 

In  addition  to  the  usual  set  of  instructions,  the  architecture  provides  the  following  transac¬ 
tional  instructions: 

•  Trans-Read  reads  the  value  of  a  shared  variable  into  a  local  variable. 

•  Trans-Write  tentatively  writes  the  value  in  a  private  variable  to  a  shared  variable. 
This  new  value  does  not  become  visible  to  other  processors  until  the  transaction 
successfully  commits  (see  below). 

•  Commit  attempts  to  make  the  transaction’s  tentative  changes  permanent.  It  succeeds 
only  if  no  other  transaction  has  updated  any  location  in  the  transaction’s  read  or 
write  set,  and  no  other  transaction  has  read  any  location  in  this  transaction’s  write 
set.  If  it  succeeds,  the  transaction’s  changes  to  its  write  set  become  visible  to  other 
processes.  If  it  fails,  all  changes  to  the  write  set  are  discarded.  Either  way.  Commit 
returns  an  indication  of  success  or  failure. 

•  Abort  discards  all  updates  to  the  write  sot. 

This  architecture  is  a  simplified  version  of  Irausnclional  fnemory,  a  cache  structure  proposed 
by  Herlihy  and  Moss  [8].  A  complete  de.scription  of  the  transactional  memory  implemen¬ 
tation  is  beyond  the  scope  of  this  abstract  (sec  [H]  for  details).  For  now,  we  remark  that 
transactional  memory  is  implemented  by  modifying  standard  ownership-based  c&che  consis¬ 
tency  protocols.  It  requires  a  small,  fully-associative  transactional  cache  in  addition  to  the 
regular  cache.  Non-transactional  operations  ii.se  the  same  caches,  cache  controller  logic,  and 
consistency  protocols  they  would  have  used  in  the  absence  of  transactional  memory.  Cus¬ 
tom  hardware  support  is  restricted  to  caches  and  their  controllers;  transactional  memory 
requires  no  other  changes  to  standard  proce.ssor  architectures. 


3  Using  Transactional  Memory 


In  this  section  we  show  how  to  use  transactional  memory  for  optimistic  parallelization,  and 
we  propose  some  simple  extensions  for  efbeienry. 

Here  is  a  first  attempt  at  parallelizing  the  loop  from  Figure  1: 

for  (i  =  0  +  PROCESSOR:  i  <  N;  i  +=  NO.OF.PROCESSORS)  { 
restart: 

tl  =  Trans_Read(ftC[i] ) ; 
t2  =  Trans_Read(ftB[i] ) ; 

Trans_Writ0(ftA[t2] ,tl) ; 
while  (coimter  !=  i)  /*  wait  */  ; 
if  (! Trans _Conunit() )  { 
backoff  (); 
goto  restart; 

> 

Increment  (icounter) ; 


} 


for  (i  •  0  +  PROCESSOR;  i  <  N;  i  +=  NO.OF.PROCESSORS)  i 
Set.Priority.RegistarCN  -  i) ; 
restart: 

tl  «  Tran8_Read(4CCi] ) ; 
t2  »  Tran8_Read(4B[i]) ; 

Trans.Writ9(tACt2] ,tl) ; 
while  (counter  !»  i)  /♦  wait  ♦/  ; 
if  ( !Tr  ana  .Commit  0) 
goto  resteurt; 

Increment  (^counter) ; 


Figure  2:  Optimistically  Parallelized  Loop 


If  the  computation  terminates,  the  proper  serialization  is  observed.  Unfortunately,  this 
simple  translation  can  lead  to  livelock.  If  a  later  iteration  writes  to  a  location  before  an 
earlier  iteration  c«:cesses  it,  then  the  earlier  iteration’s  transaction  will  continually  abort 
and  restart,  while  the  later  transaction  will  wait  forever  for  the  counter  to  be  incremented. 

3.1  Priority  Registers 

Effective  parallelization  requires  that  later  iterations  be  aborted  in  preference  to  earlier 
iterations.  To  this  end,  we  augment  transactional  memory  as  follows.  Each  processor  is 
given  a  priority  register  that  holds  a  fixed-size  value.  When  a  conflict  occurs,  the  cache 
consistency  protocol  aborts  the  transaction  whose  register  holds  the  lesser  value.  If  the 
values  are  the  same,  the  original  semantics  of  transactional  memory  is  preserved  (either 
transaction  can  be  aborted). 

The  example  loop  might  now  be  parallelized  as  shown  in  Figure  2.  Earlier  iterations 
have  higher  priority  than  later  iterations,  and  if  a  conflict  occurs,  the  earlier  iteration 
will  progress.  The  transaction  with  the  highest  priority  will  never  be  aborted  by  a  data 
conflict. 

As  an  additional  optimization,  if  a  lower- priority  tran.saction  is  about  to  write  to  a  variable 
read  or  written  by  a  higher-priority  tran.saction,  it  can  .simply  be  stalled  until  the  higher- 
priority  transaction  commits  or  aborts. 

When  a  priority  register  is  about  to  overflow,  we  can  re-normalize  the  blocks’  priorities, 
giving  the  “earliest”  transaction  the  highest  available  priority,  or  processors  can  re-normaJize 
priorities  on  the  fly  by  keeping  track  of  the  highest  nnd  lowest  priority  transactions  in  the 
system. 

3.2  Exceptions 

Exceptions  such  as  an  address  fault  or  divide  by  zero  can  be  handled  as  follows.  When  the 
processor  receives  an  exception,  it  delays  the  transaction  until  its  priority  is  greater  than 
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or  equal  to  any  other  transaction’s  priority.  At  that  point,  it  attempts  to  checkpoint  the 
block  by  committing  its  partially  completed  transaction.  Because  there  are  no  lower-priority 
transactions,  checkpointing  the  block  will  not  cause  any  data  conflicts.  If  the  commit  is 
successful,  then  it  handles  the  exception,  and  then  starts  a  new  transaction  to  execute  the 
remainder  of  the  block.  If  the  commit  is  unsuccessful,  it  handles  the  exception  and  restarts. 

As  mentioned  above,  transactional  memory  u.ses  a  small  transactional  cache  for  conflict 
detection  and  recovery.  Transactional  cache  overflow  might  be  avoided  if  the  size  of  the 
transactional  cache  is  available  to  the  compiler,  but  it  is  preferable  to  compile  programs 
in  a  configuration-independent  way.  When  the  transactional  cache  is  about  to  overflow,  it 
simply  sends  an  interrupt  back  to  the  processor,  and  the  interrupt  is  handled  as  described 
above. 

4  General  Control  Constructs 

50  far,  we  have  considered  only  loops.  In  this  section,  we  sketch  a  general  transformation 
for  other  control  constructs,  with  special  attention  to  procedure  calls. 

.At  or  near  the  beginning  of  each  basic  block,  a  thread  is  forked  to  execute  the  “next”  block. 
The  earlier  block  is  given  higher  priority  than  the  later  block  (and  any  blocks  forked  by  the 
later  block.)  Branch  prediction  techniques  [I<S]  may  be  needed  to  guess  which  block  will  be 
next.  If  the  guess  is  wrong,  then  the  sjteculative  block  (and  the  blocks  it  forked)  can  be 
aborted  and  rolled  back. 

For  certain  constructs,  additional  control  information  can  be  used  by  the  compiler  to  sched¬ 
ule  basic  blocks  earlier  or  to  avoid  aborting  work  that  is  always  necessary.  As  an  example, 
consider  an  if-statement; 

if  (El) 

SI; 

else 

S2: 

S3; 

The  compiler  can  arrange  to  have  El,  SI,  S2,  and  S3  evalu.ated  in  parallel  as  transactions. 
Once  the  evaluation  of  El  is  complete  and  the  direction  of  the  branch  is  determined,  either 

51  or  S2  can  be  forced  to  abort.  However,  since  control  How  must  always  go  through  S3,  it 
does  not  need  to  be  aborted  unless  there  is  a  data  conflict  with  one  of  the  earlier  statements. 

Later  blocks  may  need  values  computed  by  earlier  blocks  (c.r/.,  a  loop  index).  These  values 
can  be  stored  in  shared  variables.  If  we  can  determine  statically  that  a  block  depends  on  a 
value  computed  by  an  earlier  block,  then  we  can  delay  forking  the  later  block  until  the  value 
is  computed.  If  we  cannot  determine  statically  whether  such  a  dependency  exists,  the  later 
block  can  read  the  variable  transactionally,  ensuring  that  if  an  earlier  block  updates  the 
variable,  the  later  block  will  be  aborted  and  restarted.  Values  such  as  loop  indices  should 
be  calculated  as  soon  as  possible. 

Calling  procedures  in  parallel  requires  using  cactus  stacks  for  cactivation  frames,  or  allocating 
the  frames  from  the  heap.  For  recursive  procedures,  it  is  not  always  possible  to  assign 
priorities  statically.  For  example,  consider  the  following  binary-tree  traversal. 


void  traverse  (tree  t) 

{ 

if  (t  !»  NULL)  -C 
traverse  (t->left); 
traverse  (t->right) ; 

} 

} 

To  execute  the  traversal  in  parallel,  we  must  assign  priorities  so  that  all  nested  calls  travers¬ 
ing  the  left-hand  subtree  have  higher  priority  than  any  call  traversing  the  right-hand  sub¬ 
tree.  Such  priorities  can  be  assigned  dynamically  by  allocating  each  invocation  a  range  of 
priorities,  where  every  priority  allocated  to  a  left-hand  call  is  higher  than  any  value  in  the 
allocated  range  of  the  right-hand  call.  The  depth  to  which  recursive  calls  can  be  parallelized 
is  limited  by  the  size  of  the  priority  register. 


5  Simulation  Results 

Our  basic  premise  is  that  support  for  extended  transactional  memory  can  make  optimistic 
parallelization  effective.  To  test  this  hypothesis,  we  hand-compiled  a  number  of  programs 
found  in  the  literature,  and  ran  them  on  a  simulated  multiprocessor  incorporating  our  modi¬ 
fications.  Each  program  was  parallelized  using  both  conventional  and  optimistic  techniques.- 

We  kept  our  experiments  as  simple  as  possible.  All  processors  were  started  at  the  beginning 
of  the  program,  executing  the  same  code.  For  “pessimistic”  parallelization,  we  parallelized 
loops  for  which  the  absence  of  dependencies  wa.s  easy  to  prove  statically.  The  processors 
executed  the  iterations  of  the  loop  in  roun<l-robin  fashion  with  no  synchronization.  All  other 
code  was  executed  by  a  single  processor.  Minimal  barrier  synchronizations  were  inserted 
between  the  serial  and  parallelized  code  to  prevent  race  conditions.  No  other  parallelization 
techniques  (such  as  renaming  to  eliminate  dependencies)  or  combining  trees  for  associative 
operations  were  used. 

For  “optimistic”  parallelization,  we  parallelized  loops  for  which  it  was  not  apparent  whether 
data  dependencies  would  exist  at  runtime  or  not.  I'iie  reads  and  writes  were  converted  to 
transactional  reads  and  writes  and  a  .share<l  counter  wivs  used  to  force  the  proper  serial¬ 
ization.  At  the  end  of  the  loop,  a  barrier  synchronization  was  performed  and  the  shared 
variable  was  reset. 

5.1  Proteus 

Our  programs  were  simulated  using  a  version  of  the  Proteus  [3]  simulator  which  we  modified 
to  support  enhanced  transactional  memory.  Proteus  is  an  execution-driven  simulator  system 
for  multiprocessors  developed  by  Eric  Brewer  and  Chris  Dellarocas  of  MIT.  The  program 
to  be  simulated  is  written  in  a  superset  of  C.  Ileferences  to  shared  memory  trap  to  the 
simulator,  and  other  instructions  are  executed  directly,  augmented  by  cycle-counting  code 
inserted  by  a  preprocessor.  Because  most  of  the  program  is  executed  directly  by  the  host 
processor,  large  simulations  can  be  run  relatively  (piickly.  Proteus  does  not  capture  the 
effects  of  instruction  caches  or  local  caches. 
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The  simulator  can  be  configured  to  support  a  variety  of  multiprocessor  architectures.  We 
started  with  a  bus-based,  sequentially  consistent,  rarhe-coherent  architecture  using  the 
Goodman  Snoopy-cache  protocol  [6j.  We  augmented  the  cache  simulation  to  support  trans¬ 
actional  memory  and  most  of  our  enhancements  for  optimistic  parallelization. 

The  following  parameters  describe  the  machine(s)  that  we  simulated:  A  majcimum  of  8 
processors  were  used.  Each  processor  had  a  direct-mapped  64  kilobyte  cache  with  a  line 
size  of  two  32-bit  words.  The  transactional  caches  associated  with  each  processor  held  64 
lines.  Each  processor  had  a  32-bit  priority  register.  Cache  access  latency  was  1  cycle  while 
memory  access  latency  was  4  cycles. 

5.2  Simulation  Results 

The  programs  that  we  chose  to  simulate  are  drawn  from  a  range  of  applications.  They  are 
necessarily  small  since  parallelization  had  to  he  done  by  hand.  We  chose  programs  and  algo¬ 
rithms  which  were  published  as  our  ba.seline  se<|uential  code  and  concentrated  on  programs 
whose  control  was  loop-based  and  for  which  o()timistic  parallelization  looked  promising. 
Data  sets  were  chosen  essentially  at  random.  Since  Proteus  is  entirely  deterministic,  the 
same  data  sets  were  used  for  all  versions  of  a  program. 

For  each  program,  we  give  figures  which  show  the  speedup  for  the  optimistically  parallelized 
code  and,  if  applicable,  the  pessimistically  parnlleli/ed  code.  In  the  following  paragraphs, 
we  give  abrief  description  of  each  of  the  programs  and  attempt  to  explain  these  performance 
results. 

The  knapsack  program  is  taken  from  [17,  page  .')96j  and  uses  dynamic  programming.  The 
goal  is  to  maximize  the  value  of  the  elements  that  ran  be  placed  into  a  bag  of  fixed  capacity, 
where  each  element  has  an  associated  size  and  value.  As  in  most  dynamic  programming 
problems,  there  is  a  potential  flow-dependency  between  earlier  and  later  iterations.  Thus, 
the  core  of  the  program  cannot  be  parallelize*!  using  conventional  techniques.  However, 
conflicts  are  dependent  on  the  data,  so  the  core  can  be  optimistically  parallelized.  Figure  3 
shows  the  speedup  for  the  optimistically  parallelized  code.  .As  a  data  set,  we  used  a  fixed- 
capacity  knapsack,  a  fixed  number  of  items,  and  random  sizes  and  values  for  each  different 
item.  The  sizes  of  the  items  were  restricted  to  be  a  sm.all  fraction  of  the  capacity  of  the 
knapsack. 

The  convex  program  is  taken  from  [17,  page  ;{()!]  ami  is  a  simple  O(n^)  program  that 
finds  the  convex  hull  of  a  set  of  randomly  placc'd  points.  Essentially,  one  point  known  to 
be  on  the  hull  is  chosen  and  points  with  the  minimum  angle  from  the  last  chosen  point 
are  successively  added.  Neither  of  the  main  loops  can  be  parallelized  using  conventional 
techniques.  We  chose  to  parallelize  the  inner  loon  optimistically.  Figure  4  shows  the  speedup 
for  the  optimistically  parallelized  code. 

The  radix  program  is  a  radix  sort  of  1024  .ri-bit  ramlom  values,  using  an  8-bit  radix.  The 
program  is  taken  from  [17,  page  140).  The  program  has  an  outer  loop  that  is  executed  a 
small  number  of  times  (in  this  case  4)  and  nested  within  it  are  five  separate  loops.  Only 
two  of  these  loops  may  be  conventionally  parallelized  given  our  compilation  model.  An 
additional  loop  can  be  optimistically  parallelized.  Figure  .'5  shows  the  speedup  for  both  the 
pessimistically  and  optimistically  parallelized  code. 

The  solver  program  is  a  simple  0(n'^)  program  which  solves  a  .set  of  ti  linear  equations 
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as  su^esTedl  [17™  361  optimizations  added 

which  drives  a  mar  •  7  program  consists  of  two  phases,  an  elimination  phase 

hjch  drives  a  matrix  to  an  upper-tnangnlar  form,  and  a  substitute  phase  which  calculates 

Figure  1  aH  ^uf  0^’’.!,  “r”’  Problem  of 

u  U  ,  r  ^  innermost  loop  of  the  elimination  phase  can  be  oessimisti 

substit"fotL  h  tT-  'bp  »con,l.moet-i„uer  loop.  The 

substitute  phase  has  a  definite  data  dependenrv  across  its  inner-most  Iood  and  rann^r  K 

pessimistically  parallelized.  Even  though  tl.e  dependency  is  definite,  we  applied  optimistic 

par^elization  since  some  work  can  still  be  done  in  parallel.  Figure  6  shows  the  soJ^dun  for 

oth  the  pessimisticaUy  and  optimistically  parallelized  co<le.  Note  that  the  pessimistic  code 

outperforms  the  optimistic  code  for  more  than  1  proce.ssors.  We  suspect  tharZislelo 

»e  large  transaction  size  of  the  loops  that  were  chosen  to  be  parallelized  (see  below). 

Some  general  conclusions  can  be  drawn  from  these  benchmarks.  For  instance  in  all  of 
the  benchmarks,  the  overhead  of  doing  a  loop  tran.saction.ally  becomes  acceptable  as  soon 
a.  more  than  one  processor  is  used.  However,  speedup  seems  to  level Tff  ^alouTs  5 

paraUehsm  in  the  programs  and  more  .mport.-intlv,  only  loops  were  identified  for  paralleliza- 

r.  „:r:.T:Le'  . . . 

oAhtTuT"”  --"-i'-i-*  in  tbe  .ruuslalions.  A  portion 

his  meflicinncy  is  due  to  the  naive  implementation  of  the  barrier  synchronizations  and 
the  shared  location  used  to  serialize  tr.a„«acti„ns.  This  can  be  alleviated  by  us  L?  mo  . 
sophisticated  techniques  such  as  counting  networks  or  „ense- reversing  barriers.  However, 
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such  techniques  can  increase  the  latency  of  operations,  making  them  unattractive  for  small 
numbers  of  processors. 

Some  of  the  inefficiency  in  the  translation  is  due  to  the  small  granularity  of  the  work  done 
by  each  processor.  Small  transactions  make  it  difficult  to  amortize  the  cost  of  doing  the 
synchronizations.  For  instance,  the  transactions  of  the  knapsack  program  are  only  4  lines 
of  code.  The  size  of  transactions  can  be  increased  by  using  techniques  such  as  strip-mining 
or  loop-unrolling.  However,  making  a  transaction  larger  increases  the  probability  that  a 
conflict  will  occur  with  some  other  transaction,  especially  as  the  number  of  processors 
grows.  Larger  transactions  also  require  larger  transactional  caches. 

To  demonstrate  the  effect  that  transaction  size  ran  have  on  execution  time,  we  ran  the 
radix  program,  varying  the  number  of  loop-iterations  per  transaction  and  the  number 
of  processors.  Results  for  these  tests  are  shown  in  Figure  7.  As  expected,  very  small 
transactions  (one  iteration)  give  poor  performance  with  any  number  of  processors.  Larger 
transactions  give  better  performance  for  small  numbers  of  processors,  but  performance  can 
degrade  for  large  transactions  and  large  numbers  of  processors. 


6  Related  Work 

There  is  a  vast  literature  on  optimistic  techniques  for  database  synchronization.  The  two 
earliest  and  most  influential  papers  are  by  Thomas  [10]  and  by  Kung  and  Robinson  [12]. 

Knight  [11]  proposed  an  architecture  in  which  basic  blocks  were  scheduled  to  run  in  parallel 
with  transactional  semantics.  He  also  proposed  the  use  of  a  shared  counter  to  force  the 
proper  serialization.  The  ParaTran  System  [10,  20]  applied  these  ideas  in  an  optimistically 
parallelizing  compiler  for  Scheme.  ParaTran  used  software  techniques  adapted  from  the 
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Figure  7:  Varying  Transaction  Sizes  in  Radix  Sort 
database  literature  for  conflict  detection  and  recovery. 

Franklin  and  Sohi  [5j  propose  a  hardware  architecture  that  optimistically  parallelizes  code 
at  runtime.  Processors  execute  basic  blocks  in  parallel.  Serialization  is  guaranteed  by 
organizing  processors  in  a  queue,  and  branch  prediction  is  used  to  determine  the  next  basic 
block.  A  “future  file”  is  used  to  forward  register  values  from  one  processing  element  to  the 
next,  and  an  “address  resolution  buffer”  detects  conflicts.  Franklin  and  Sohi  ran  simulations 
of  real  programs  on  this  architecture  and  observed  substantial  speedups. 

Although  Franklin  and  Sohi’s  architecture  resembles  ours  in  several  respects,  it  has  two  lim¬ 
itations.  First,  it  is  a  radical  departure  from  traditional  architectures,  requiring  a  complete 
change  of  the  processing  elements.  Our  architecture  recjuires  modest  changes  to  caches  and 
their  controllers,  and  support  for  a  few  new  instructions.  Second,  in  their  architecture, 
processors  are  forced  to  execute  a  serial  stream  of  instructions,  while  in  our  scheme,  a  mul¬ 
tiprocessor  may  still  execute  independent  instruction  streams.  Recent  studies  [l,  13,  21] 
have  shown  that  taking  advantage  of  independent  instruction  streams  can  have  a  significant 
impact  on  performance. 

In  [9],  Larus  and  Huelsbergen  propose  two  tochnicpies  which  support  dynamic  program 
parallelization.  Dynamic  parallelization,  like  optimistic  parallelization,  can  find  more  par¬ 
allelism  than  static  analysis  by  using  runtime  information.  Unlike  optimistic  parallelization, 
dynamic  parallelization  first  tests  to  see  if  it  is  safe  to  run  a  parallelized  version  of  code  and  if 
not,  falls  back  on  a  sequential  version  of  the  code.  Obviously,  testing  to  see  if  parallelization 
can  be  done  must  be  cheap.  Thus,  these  tests  are  usually  conservative  approximations. 


II 
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